Chapter 4 Exploratory Data Analysis

4.1 Start with dplyr counts and summaries in console

  • David Robinson often starts exploring data with simple counts in the console.

  • Here we don’t use the package name (so breaking the rule I just told you) in the console so we can type more quickly and explore the data with dplyr verbs faster.

4.2 Plot data points with geom_point()

  • After using dplyr count(), group_by() and summarise(), try plotting all data points with ggplot2::geom_point(). It almost NEVER fails to show you what’s going on quickly and is unlikely to return errors.

  • ggplot2::geom_point() is the minimum and most reliable ggplot plot type (or geom) to start with.

  • Let’s look at all the values of sales for each date.

## Warning: Removed 568 rows containing missing values (geom_point).

  • Now let’s look at the individual sales values over the values of the city column.
## Warning: Removed 568 rows containing missing values (geom_point).

  • The points make very dark lines. This is where we can’t see separate data points as so many overlap. This is known as over plotting. Solve this by replacing geom_point() with geom_jitter(). This randomly “jitters” the data points around so that they are less likely to overlap.

  • Sometimes there are so many data points the jitter is not enough to reduce over plotting. We can also make the dots lighter using a a parameter called alpha. The lower the value of alpha the fainter the data points.

## Warning: Removed 568 rows containing missing values (geom_point).

Hadley Wickham has a few more tricks to solve over plotting in the overplotting chapter of his ggplot book.

  • We all know sales of most things vary by the time of the year. So let’s now put date on the x axis, make city the colour, and because the data is over time we can join the data points using ggplot2::geom_line().

  • We’re also using the reduced data set with fewer cities so the plot is less crowded with fewer lines.

## Warning: Removed 1 rows containing missing values (geom_path).

  • Beautiful. While sales have very different volumes in different cities we can see they follow the same seasonal pattern. To bring the patterns of sales closer to each other and easier to compare we can transform the sales value by taking its log. This is Hadley Wickham’s approach in ggplot2: Elegant Graphics for Data Analysis.

  • He goes on to model the data by fitting a linear model between the log of sales and the month, then plotting the residuals (i.e. the change in sales not explained by the month). This removes the strong seasonal effects. We will take a simpler approach to reducing the seasonal effect in the final plot in this chapter by presenting the entire series zoomed out with years clearly marked.

## Warning: Removed 1 rows containing missing values (geom_path).

4.3 Facet by categories

  • Another logical step after showing categories by colour is to use “small multiples”. This is a fancy way of saying draw a chart for each value in one or more columns then look at all the plots at once. Usually in a grid.

  • An important setting for facets is to specify scales = “free” so each small plot has its own scale. This lets us more easily spot interesting differences in the patterns over time between plots.

## Warning: Removed 1 rows containing missing values (geom_path).

4.4 Facet interactively (trelliscopejs)

  • An interactive way to facet and explore your data with a GUI in R is trelliscopejs. Below we facet all the Texas cities in a trelliscope web page. Have a play with all the settings and see what it does.

4.5 Loop to plot every category separately

  • To study each city as a full single chart automatically we can loop through our data.

  • We can nest a dataframe for each city into one dataframe. Then loop through each nested dataframe creating a plot for each one.

  • First we use dplyr::group_by() for city and then nest by that grouping using tidyr::nest().

  • This shows us what a nested dataframe looks like.
  • We can also view one of the nested data frames using square brackets. Think of the numbers in the square brackets like the co-ordinates in Excel. The first number is the column position and the second number is the row position.
  • We can now add a plot to each nested data frame. We use purrr::map2(). This is a compact way to loop through two arguments in a function. In this case the values being set are the data set inside each row and the value of city column.
  • Take a look at the new nested data frame with a new column added containing a plot for each city.
  • Let’s also look at the information held for one of the plots, again using values in square brackets. The code below shows you that the plot is a series of nested lists that describe every element of the plot.
  • Finally, let’s print every plot quite simply with this code.
Show all the looped prints

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]
## Warning: Removed 1 rows containing missing values (geom_path).

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]


4.6 Polish your final plot

  • We now have a bare minimum Exploratory Data Analysis toolkit of how to explore the data both wtih dplyr counts from the console and with points and lines in ggplot.

  • We will soon be ready to select a plot that tells an interesting story we have found by exploring the data. But adding all the bells and whistles to make the plot customer or publication ready can and does take a long time. So this polish shouldn’t be part of your exploratory data analysis.

  • Also, make sure the polishing is done with the clean code style recommended earlier. It’s far quicker then to comment out or tweak the values of each part of your plot until it looks just right. Clean code is faster to iterate.

  • The plot below isn’t perfect. There may be things you want to change depending on what story you want to tell or your personal style.

  • How did I create it? By Googling for what I wanted to do (e.g. “ggplot remove axis grid lines”), copying the code from a stackoverflow answer, and putting it into a clear structure as below.

  • Many of the tweaks or polish will be to ggplot2::theme() or ggplot2::scale… But are you really going to remember what do to each time? I try not to worry about remembering how to do it and just focus on how I want it to look and get absorbed in the creation and satisfaction of it gradually improving.

  • After you have built a few of your own charts with clear code you will soon be using your own plots as a store of code chunks to re-use.

  • Be prepared for this tweaking and polishing to take you much longer than you planned. Always.

## label_key: city
## Saving 7 x 5 in image
## Warning: package 'gdtools' was built under R version 3.6.1
## Warning: Removed 430 rows containing missing values (geom_path).